Project: GapMinder Education Data Analysis¶

Introduction¶

I am going to work on GapMinder Education data. Solely education data doesn't give the much needed insights hence I am going to look at Economy and Employability Indicators as well as get insights on the relationship among each other.

We will be working with the following research questions and get data accordingly from GapMinder, our data source.¶

  1. What is the gender ratio of students who enroll in primary and secondary education for least 5 gross income countries?
  1. Does employement ratio get affected based on more no of years spent in school?

Table of Contents¶

  • Introduction
  • Data Wrangling
  • Exploratory Data Analysis
    • Research Question 1
    • Research Question 2
  • Conclusions
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio

%matplotlib inline

Data Wrangling¶

Lets first get data for all the posed research questions!¶

Go back to top¶

Gender Ratio of enrollment in school¶

The data has the gender parity index for gross enrollment in primary and secondary education. This is the ratio of girls to boys enrolled at primary and secondary grades of public and private schools.
Source : World Bank
No of countries : 204

In [2]:
df_gend_ratio_pr_sec_enrollment = pd.read_csv("./Data/ratio_of_girls_to_boys_in_primary_and_secondary_education_perc.csv")
print("Total No of rows : ", df_gend_ratio_pr_sec_enrollment.shape[0])
print("Total No of columns : ", df_gend_ratio_pr_sec_enrollment.shape[1])
df_gend_ratio_pr_sec_enrollment.head()
Total No of rows :  204
Total No of columns :  53
Out[2]:
country 1970 1971 1972 1973 1974 1975 1976 1977 1978 ... 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
0 Aruba NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 1.020 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Afghanistan 0.167 0.161 0.161 0.169 0.167 0.174 0.181 0.192 0.199 ... 0.669 0.655 0.654 0.642 0.646 0.642 0.636 NaN NaN NaN
2 Angola NaN 0.640 0.657 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Albania NaN NaN NaN NaN NaN NaN 0.923 NaN 0.925 ... 0.990 0.982 0.977 0.982 0.994 1.000 1.020 1.02 1.02 NaN
4 Andorra NaN NaN NaN NaN NaN 1.150 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 53 columns

Gender Ratio of Number of Years in School¶

The data consists of percentage of ratio of years spent by females to males in primary,secondary and tertiary education. It is collected for people of ages in the range of 25 to 34.
Source : Institute for Health Metrics and Evaluation (IHME), University of Washington
No of countries : 188

In [3]:
df_gend_ratio_pr_sec_ter_yrs = pd.read_csv("./Data/mean_years_in_school_women_percent_men_25_to_34_years.csv")
print("Total No of rows : ", df_gend_ratio_pr_sec_ter_yrs.shape[0])
print("Total No of columns : ", df_gend_ratio_pr_sec_ter_yrs.shape[1])
df_gend_ratio_pr_sec_ter_yrs.head()
Total No of rows :  188
Total No of columns :  47
Out[3]:
country 1970 1971 1972 1973 1974 1975 1976 1977 1978 ... 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
0 Afghanistan 15.4 15.8 15.4 15.6 15.9 16.1 16.4 16.6 16.2 ... 21.5 21.9 22.2 22.3 22.6 22.9 23.1 23.4 23.5 23.7
1 Angola 51.3 51.4 51.9 52.3 52.8 53.2 53.4 53.8 54.3 ... 68.5 68.9 69.5 70.1 70.5 71.2 71.7 72.2 72.9 73.3
2 Albania 87.4 87.9 88.3 88.9 89.2 89.7 90.2 90.6 91.0 ... 100.0 101.0 101.0 101.0 101.0 102.0 102.0 102.0 102.0 103.0
3 Andorra 97.0 97.4 97.8 98.1 98.4 98.8 99.1 99.5 99.8 ... 105.0 105.0 105.0 105.0 105.0 105.0 106.0 106.0 106.0 106.0
4 United Arab Emirates 90.9 91.4 92.0 92.4 93.0 93.6 94.0 94.5 95.2 ... 105.0 105.0 105.0 105.0 105.0 105.0 105.0 105.0 106.0 106.0

5 rows × 47 columns

Some percentage data points were > 100, firstly I concluded that those must be incorrect and need to be cleaned. On delving deep to the source World Bank website, I found and I quote - "There are many reasons why the primary completion rate can exceed 100 percent. The numerator may include late entrants and overage children who have repeated one or more grades of primary education as well as children who entered school early, while the denominator is the number of children at the entrance age for the last grade of primary education." Hence, this isn't incorrect data and needs no corrections.¶

Employment Rate (15+ aged)¶

The data consists the percentage of all 15+ aged people who were employed that year.
Source : International Labour Organization
No of countries : 189

In [4]:
df_emp = pd.read_csv("./Data/aged_15plus_employment_rate_percent.csv")
print("Total No of rows : ", df_emp.shape[0])
print("Total No of columns : ", df_emp.shape[1])
df_emp.head()
Total No of rows :  189
Total No of columns :  31
Out[4]:
country 1991 1992 1993 1994 1995 1996 1997 1998 1999 ... 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
0 Afghanistan 42.5 42.5 42.5 42.5 42.4 42.4 42.3 42.2 42.2 ... 42.3 42.4 42.5 42.7 42.9 43.0 43.2 43.4 43.5 41.5
1 Angola 75.0 75.0 75.2 75.1 74.9 74.9 74.8 74.7 74.6 ... 71.7 71.8 71.8 71.9 71.9 72.0 72.1 72.1 72.1 69.6
2 Albania 57.8 58.2 56.8 55.7 54.1 53.3 54.5 53.8 52.7 ... 52.0 49.4 44.7 43.7 46.0 47.9 49.3 52.0 53.4 52.7
3 United Arab Emirates 71.8 72.2 72.9 73.4 73.8 73.3 73.1 73.3 73.7 ... 81.7 81.5 81.3 81.3 81.6 81.2 80.3 80.3 80.2 76.9
4 Argentina 57.3 56.9 54.9 54.0 49.5 50.7 52.5 54.1 53.1 ... 56.3 56.1 56.0 55.4 55.5 55.5 55.5 55.7 55.5 49.4

5 rows × 31 columns

Income per person data¶

The data consists of GDP per per capita which is calculated by the total amount (international dollars, fixed to 2017 prices) divided by the total population of the country.The data is adjusted for inflation and differences in the cost of living between countries, known as PPP dollars.
Source : Gapminder based on World Bank
No ofcountries : 210

In [5]:
df_income_per_person = pd.read_csv("./Data/income_per_person_gdppercapita_ppp_inflation_adjusted.csv")
print("Total No of rows : ", df_income_per_person.shape[0])
print("Total No of columns : ", df_income_per_person.shape[1])
df_income_per_person.head()
Total No of rows :  195
Total No of columns :  252
Out[5]:
country 1800 1801 1802 1803 1804 1805 1806 1807 1808 ... 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050
0 Afghanistan 683 683 683 683 683 683 683 683 683 ... 2690 2750 2810 2870 2930 2990 3060 3120 3190 3260
1 Angola 700 702 705 709 711 714 718 721 725 ... 8000 8170 8350 8530 8710 8900 9090 9280 9480 9690
2 Albania 755 755 755 755 755 756 756 756 756 ... 25.1k 25.6k 26.2k 26.7k 27.3k 27.9k 28.5k 29.1k 29.7k 30.4k
3 Andorra 1360 1360 1360 1360 1370 1370 1370 1370 1380 ... 68.9k 70.4k 71.9k 73.4k 75k 76.6k 78.3k 80k 81.7k 83.4k
4 United Arab Emirates 1130 1130 1140 1140 1150 1150 1160 1160 1160 ... 101k 103k 105k 107k 110k 112k 114k 117k 119k 122k

5 rows × 252 columns

Data Cleaning¶

Go back to top¶

Firstly lets look at the datatypes of all dataframes and convert them to proper datatypes

In [6]:
print(df_gend_ratio_pr_sec_enrollment.info())
df_gend_ratio_pr_sec_enrollment.describe()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 204 entries, 0 to 203
Data columns (total 53 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   country  204 non-null    object 
 1   1970     34 non-null     float64
 2   1971     116 non-null    float64
 3   1972     111 non-null    float64
 4   1973     108 non-null    float64
 5   1974     102 non-null    float64
 6   1975     98 non-null     float64
 7   1976     105 non-null    float64
 8   1977     107 non-null    float64
 9   1978     101 non-null    float64
 10  1979     99 non-null     float64
 11  1980     95 non-null     float64
 12  1981     105 non-null    float64
 13  1982     99 non-null     float64
 14  1983     100 non-null    float64
 15  1984     105 non-null    float64
 16  1985     101 non-null    float64
 17  1986     109 non-null    float64
 18  1987     101 non-null    float64
 19  1988     101 non-null    float64
 20  1989     100 non-null    float64
 21  1990     99 non-null     float64
 22  1991     100 non-null    float64
 23  1992     98 non-null     float64
 24  1993     108 non-null    float64
 25  1994     100 non-null    float64
 26  1995     93 non-null     float64
 27  1996     88 non-null     float64
 28  1997     70 non-null     float64
 29  1998     86 non-null     float64
 30  1999     142 non-null    float64
 31  2000     136 non-null    float64
 32  2001     134 non-null    float64
 33  2002     134 non-null    float64
 34  2003     130 non-null    float64
 35  2004     143 non-null    float64
 36  2005     147 non-null    float64
 37  2006     138 non-null    float64
 38  2007     141 non-null    float64
 39  2008     137 non-null    float64
 40  2009     135 non-null    float64
 41  2010     135 non-null    float64
 42  2011     141 non-null    float64
 43  2012     136 non-null    float64
 44  2013     123 non-null    float64
 45  2014     132 non-null    float64
 46  2015     133 non-null    float64
 47  2016     129 non-null    float64
 48  2017     127 non-null    float64
 49  2018     123 non-null    float64
 50  2019     115 non-null    float64
 51  2020     56 non-null     float64
 52  2021     3 non-null      float64
dtypes: float64(52), object(1)
memory usage: 84.6+ KB
None
Out[6]:
1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 ... 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
count 34.000000 116.000000 111.000000 108.000000 102.000000 98.000000 105.000000 107.000000 101.000000 99.000000 ... 136.000000 123.000000 132.000000 133.000000 129.000000 127.000000 123.000000 115.000000 56.000000 3.000000
mean 0.846285 0.774966 0.822279 0.837435 0.833902 0.863673 0.852486 0.858607 0.862950 0.876051 ... 0.981993 0.990081 0.986591 0.985662 0.988039 0.990929 0.993260 1.002670 1.010268 1.024333
std 0.268854 0.236902 0.241692 0.231280 0.230254 0.222574 0.216638 0.211052 0.203725 0.196936 ... 0.069261 0.066218 0.069373 0.070197 0.065811 0.060272 0.055941 0.043322 0.043691 0.045567
min 0.052700 0.161000 0.150000 0.169000 0.167000 0.174000 0.181000 0.192000 0.199000 0.436000 ... 0.669000 0.655000 0.654000 0.642000 0.646000 0.642000 0.636000 0.728000 0.890000 0.973000
25% 0.785000 0.579000 0.624500 0.655000 0.657000 0.673500 0.667000 0.693000 0.707000 0.713000 ... 0.974500 0.977000 0.979750 0.979000 0.982000 0.983000 0.985500 0.988000 0.989750 1.006500
50% 0.924000 0.857000 0.920000 0.937000 0.926000 0.939000 0.932000 0.936000 0.929000 0.949000 ... 0.995500 0.998000 1.000000 1.000000 1.000000 1.000000 1.000000 1.010000 1.010000 1.040000
75% 1.007250 0.984250 0.994500 1.010000 1.000000 1.010000 1.010000 1.010000 1.010000 1.010000 ... 1.020000 1.020000 1.020000 1.020000 1.020000 1.015000 1.010000 1.020000 1.022500 1.050000
max 1.450000 1.110000 1.440000 1.420000 1.180000 1.410000 1.400000 1.420000 1.440000 1.430000 ... 1.130000 1.150000 1.150000 1.130000 1.100000 1.120000 1.120000 1.140000 1.150000 1.060000

8 rows × 52 columns

All year columns have gender ratio for primary to secondary schools enrollment data as floats for all countries. So, its good to go. Next, lets deal with NULL values

In [7]:
df_gend_ratio_pr_sec_enrollment.isna().sum()
Out[7]:
country      0
1970       170
1971        88
1972        93
1973        96
1974       102
1975       106
1976        99
1977        97
1978       103
1979       105
1980       109
1981        99
1982       105
1983       104
1984        99
1985       103
1986        95
1987       103
1988       103
1989       104
1990       105
1991       104
1992       106
1993        96
1994       104
1995       111
1996       116
1997       134
1998       118
1999        62
2000        68
2001        70
2002        70
2003        74
2004        61
2005        57
2006        66
2007        63
2008        67
2009        69
2010        69
2011        63
2012        68
2013        81
2014        72
2015        71
2016        75
2017        77
2018        81
2019        89
2020       148
2021       201
dtype: int64

What we observe is, most of the recent years i.e. 2020 and old data i.e. before 1998 have most null values and shouldn't be considered while plotting as the analytics won't be correct. But for now, lets replace all rows with NULL values with the mean for each country and remove those countries with all NULL data.

In [8]:
#replace all countries with its individual means, I am doing it row-wise since the trend will match for each country and not year.
df_gend_ratio_pr_sec_enrollment.iloc[:,1:] = df_gend_ratio_pr_sec_enrollment.iloc[:,1:].apply(lambda row: row.fillna(row.mean()), axis=1)
df_gend_ratio_pr_sec_enrollment
# since old data doesn't give any useful picture, I am picking up 2000 to 2020 data.
year_list = list(map(str, range(2000, 2021)))
year_list.insert(0, 'country')
df_gend_ratio_pr_sec_enrollment = df_gend_ratio_pr_sec_enrollment[year_list]
df_gend_ratio_pr_sec_enrollment.head()
Out[8]:
country 2000 2001 2002 2003 2004 2005 2006 2007 2008 ... 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
0 Aruba 0.993000 0.999000 0.989000 0.977000 0.953000 0.963000 0.968000 0.987000 1.010 ... 1.010 1.020000 0.992071 0.992071 0.992071 0.992071 0.992071 0.992071 0.992071 0.992071
1 Afghanistan 0.437389 0.000000 0.437389 0.544000 0.406000 0.548000 0.573000 0.567000 0.586 ... 0.663 0.669000 0.655000 0.654000 0.642000 0.646000 0.642000 0.636000 0.437389 0.437389
2 Angola 0.754889 0.754889 0.754889 0.754889 0.754889 0.754889 0.754889 0.754889 0.833 ... 0.630 0.754889 0.754889 0.754889 0.754889 0.754889 0.754889 0.754889 0.754889 0.754889
3 Albania 0.976000 0.987000 0.974195 0.974195 0.980000 0.978000 0.985000 0.997000 1.000 ... 1.000 0.990000 0.982000 0.977000 0.982000 0.994000 1.000000 1.020000 1.020000 1.020000
4 Andorra 1.090000 1.090000 1.090000 1.090000 1.090000 1.090000 1.090000 1.090000 1.090 ... 1.090 1.090000 1.090000 1.090000 1.090000 1.090000 1.090000 1.090000 1.090000 1.090000

5 rows × 22 columns

In [9]:
null_count = df_gend_ratio_pr_sec_enrollment.isna().sum(axis = 1).to_frame()
print("Count of countries with all NULL values : ",len(null_count[null_count[0] == df_gend_ratio_pr_sec_enrollment.shape[1]-1]))

null_indexes = null_count[null_count[0] == df_gend_ratio_pr_sec_enrollment.shape[1]-1].index
if null_indexes.size > 0:
    df_gend_ratio_pr_sec_enrollment.drop(null_indexes,  inplace = True).reset_index(drop = True)
df_gend_ratio_pr_sec_enrollment.head()
Count of countries with all NULL values :  0
Out[9]:
country 2000 2001 2002 2003 2004 2005 2006 2007 2008 ... 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
0 Aruba 0.993000 0.999000 0.989000 0.977000 0.953000 0.963000 0.968000 0.987000 1.010 ... 1.010 1.020000 0.992071 0.992071 0.992071 0.992071 0.992071 0.992071 0.992071 0.992071
1 Afghanistan 0.437389 0.000000 0.437389 0.544000 0.406000 0.548000 0.573000 0.567000 0.586 ... 0.663 0.669000 0.655000 0.654000 0.642000 0.646000 0.642000 0.636000 0.437389 0.437389
2 Angola 0.754889 0.754889 0.754889 0.754889 0.754889 0.754889 0.754889 0.754889 0.833 ... 0.630 0.754889 0.754889 0.754889 0.754889 0.754889 0.754889 0.754889 0.754889 0.754889
3 Albania 0.976000 0.987000 0.974195 0.974195 0.980000 0.978000 0.985000 0.997000 1.000 ... 1.000 0.990000 0.982000 0.977000 0.982000 0.994000 1.000000 1.020000 1.020000 1.020000
4 Andorra 1.090000 1.090000 1.090000 1.090000 1.090000 1.090000 1.090000 1.090000 1.090 ... 1.090 1.090000 1.090000 1.090000 1.090000 1.090000 1.090000 1.090000 1.090000 1.090000

5 rows × 22 columns

In [10]:
print(df_gend_ratio_pr_sec_ter_yrs.info())
df_gend_ratio_pr_sec_ter_yrs.describe()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188 entries, 0 to 187
Data columns (total 47 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   country  188 non-null    object 
 1   1970     188 non-null    float64
 2   1971     188 non-null    float64
 3   1972     188 non-null    float64
 4   1973     188 non-null    float64
 5   1974     188 non-null    float64
 6   1975     188 non-null    float64
 7   1976     188 non-null    float64
 8   1977     188 non-null    float64
 9   1978     188 non-null    float64
 10  1979     188 non-null    float64
 11  1980     188 non-null    float64
 12  1981     188 non-null    float64
 13  1982     188 non-null    float64
 14  1983     188 non-null    float64
 15  1984     188 non-null    float64
 16  1985     188 non-null    float64
 17  1986     188 non-null    float64
 18  1987     188 non-null    float64
 19  1988     188 non-null    float64
 20  1989     188 non-null    float64
 21  1990     188 non-null    float64
 22  1991     188 non-null    float64
 23  1992     188 non-null    float64
 24  1993     188 non-null    float64
 25  1994     188 non-null    float64
 26  1995     188 non-null    float64
 27  1996     188 non-null    float64
 28  1997     188 non-null    float64
 29  1998     188 non-null    float64
 30  1999     188 non-null    float64
 31  2000     188 non-null    float64
 32  2001     188 non-null    float64
 33  2002     188 non-null    float64
 34  2003     188 non-null    float64
 35  2004     188 non-null    float64
 36  2005     188 non-null    float64
 37  2006     188 non-null    float64
 38  2007     188 non-null    float64
 39  2008     188 non-null    float64
 40  2009     188 non-null    float64
 41  2010     188 non-null    float64
 42  2011     188 non-null    float64
 43  2012     188 non-null    float64
 44  2013     188 non-null    float64
 45  2014     188 non-null    float64
 46  2015     188 non-null    float64
dtypes: float64(46), object(1)
memory usage: 69.2+ KB
None
Out[10]:
1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 ... 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
count 188.000000 188.000000 188.000000 188.000000 188.000000 188.00000 188.000000 188.000000 188.000000 188.000000 ... 188.000000 188.000000 188.000000 188.000000 188.000000 188.000000 188.000000 188.000000 188.000000 188.000000
mean 74.872340 75.308511 75.735106 76.148404 76.589894 77.01383 77.428191 77.861702 78.284574 78.700000 ... 89.390426 89.669149 89.972340 90.304255 90.625532 90.894681 91.191489 91.460106 91.720213 91.962766
std 23.531585 23.535147 23.561424 23.535533 23.561574 23.56049 23.545932 23.534903 23.514179 23.467923 ... 21.011729 20.870095 20.741023 20.640666 20.540589 20.402434 20.289776 20.143014 20.022101 19.865456
min 11.200000 11.300000 11.400000 11.900000 12.000000 12.00000 12.400000 12.400000 12.800000 13.100000 ... 21.500000 21.900000 22.200000 22.300000 22.600000 22.900000 23.100000 23.400000 23.500000 23.700000
25% 54.700000 55.325000 55.975000 56.500000 57.000000 57.67500 58.300000 58.850000 59.400000 59.950000 ... 78.450000 78.925000 79.475000 79.975000 80.525000 81.050000 81.475000 81.925000 82.425000 82.925000
50% 85.550000 86.000000 86.350000 86.750000 87.150000 87.55000 87.800000 88.200000 88.650000 89.050000 ... 99.350000 99.600000 99.850000 100.000000 100.000000 100.500000 101.000000 101.000000 101.000000 101.000000
75% 93.900000 94.300000 94.725000 95.100000 95.400000 95.90000 96.225000 96.600000 96.825000 97.300000 ... 104.000000 104.000000 104.000000 104.000000 104.250000 105.000000 105.000000 105.000000 105.000000 105.000000
max 129.000000 129.000000 129.000000 129.000000 129.000000 129.00000 130.000000 130.000000 129.000000 130.000000 ... 126.000000 127.000000 126.000000 126.000000 126.000000 126.000000 126.000000 126.000000 126.000000 126.000000

8 rows × 46 columns

All year columns have gender ratio for primary, seconday and tertiary school years attended data as floats for all countries. So, its good to go. Lets check for null values.

In [11]:
df_gend_ratio_pr_sec_ter_yrs.isna().sum()
Out[11]:
country    0
1970       0
1971       0
1972       0
1973       0
1974       0
1975       0
1976       0
1977       0
1978       0
1979       0
1980       0
1981       0
1982       0
1983       0
1984       0
1985       0
1986       0
1987       0
1988       0
1989       0
1990       0
1991       0
1992       0
1993       0
1994       0
1995       0
1996       0
1997       0
1998       0
1999       0
2000       0
2001       0
2002       0
2003       0
2004       0
2005       0
2006       0
2007       0
2008       0
2009       0
2010       0
2011       0
2012       0
2013       0
2014       0
2015       0
dtype: int64
There are no NULL values for this data, so we can move to the next dataset.¶
In [12]:
print(df_emp.info())
df_emp.describe()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 189 entries, 0 to 188
Data columns (total 31 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   country  189 non-null    object 
 1   1991     189 non-null    float64
 2   1992     189 non-null    float64
 3   1993     189 non-null    float64
 4   1994     189 non-null    float64
 5   1995     189 non-null    float64
 6   1996     189 non-null    float64
 7   1997     189 non-null    float64
 8   1998     189 non-null    float64
 9   1999     189 non-null    float64
 10  2000     189 non-null    float64
 11  2001     189 non-null    float64
 12  2002     189 non-null    float64
 13  2003     189 non-null    float64
 14  2004     189 non-null    float64
 15  2005     189 non-null    float64
 16  2006     189 non-null    float64
 17  2007     189 non-null    float64
 18  2008     189 non-null    float64
 19  2009     189 non-null    float64
 20  2010     189 non-null    float64
 21  2011     189 non-null    float64
 22  2012     189 non-null    float64
 23  2013     189 non-null    float64
 24  2014     189 non-null    float64
 25  2015     189 non-null    float64
 26  2016     189 non-null    float64
 27  2017     189 non-null    float64
 28  2018     189 non-null    float64
 29  2019     189 non-null    float64
 30  2020     189 non-null    float64
dtypes: float64(30), object(1)
memory usage: 45.9+ KB
None
Out[12]:
1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 ... 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
count 189.000000 189.000000 189.000000 189.000000 189.000000 189.000000 189.000000 189.000000 189.000000 189.000000 ... 189.000000 189.000000 189.000000 189.000000 189.000000 189.000000 189.000000 189.000000 189.000000 189.000000
mean 58.230159 58.121164 57.801587 57.757143 57.679365 57.511111 57.517989 57.473016 57.360317 57.343386 ... 57.262434 57.297354 57.275132 57.374074 57.521164 57.571429 57.766138 57.940212 58.055556 55.587302
std 11.896678 11.896820 11.946950 11.912027 11.964471 11.964199 11.896070 11.874309 11.832428 11.838322 ... 11.916831 11.924699 11.847428 11.787706 11.686243 11.649396 11.585708 11.586904 11.467977 11.255951
min 33.000000 32.900000 33.100000 32.500000 30.700000 32.100000 33.200000 32.400000 31.400000 30.600000 ... 33.100000 32.900000 32.400000 31.900000 32.300000 32.600000 32.700000 32.100000 32.800000 30.900000
25% 51.100000 50.400000 49.900000 50.100000 49.500000 49.500000 49.400000 49.600000 48.800000 49.100000 ... 50.000000 50.100000 49.700000 49.700000 49.800000 49.500000 50.100000 50.500000 50.400000 48.600000
50% 57.200000 56.900000 56.800000 57.100000 56.500000 56.400000 56.500000 56.500000 56.300000 56.500000 ... 57.200000 57.400000 57.300000 57.200000 57.800000 57.600000 58.000000 58.400000 58.300000 55.600000
75% 64.800000 64.100000 64.600000 64.700000 64.800000 64.700000 64.600000 64.500000 64.700000 64.600000 ... 64.200000 64.300000 63.800000 64.100000 64.300000 64.300000 64.400000 65.100000 65.100000 62.100000
max 90.500000 89.800000 88.700000 87.800000 86.500000 86.400000 86.100000 85.800000 85.400000 85.000000 ... 86.900000 87.800000 87.000000 87.000000 87.800000 87.200000 86.700000 86.600000 86.700000 83.300000

8 rows × 30 columns

All year columns have employement % for 15+ aged people as floats for all countries. So, its good to go. Let's check for NULL values now.

In [13]:
df_emp.isna().sum()
Out[13]:
country    0
1991       0
1992       0
1993       0
1994       0
1995       0
1996       0
1997       0
1998       0
1999       0
2000       0
2001       0
2002       0
2003       0
2004       0
2005       0
2006       0
2007       0
2008       0
2009       0
2010       0
2011       0
2012       0
2013       0
2014       0
2015       0
2016       0
2017       0
2018       0
2019       0
2020       0
dtype: int64
There are no NULL values for this data, so we can move to the income dataset.¶
In [14]:
print(df_income_per_person.info())
df_income_per_person.describe()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Columns: 252 entries, country to 2050
dtypes: int64(101), object(151)
memory usage: 384.0+ KB
None
Out[14]:
1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 ... 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900
count 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 ... 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000
mean 1072.179487 1071.353846 1074.210256 1073.753846 1075.430769 1075.174359 1076.169231 1076.251282 1066.497436 1067.553846 ... 1796.825641 1819.466667 1839.246154 1873.189744 1886.882051 1917.348718 1924.005128 1957.410256 1981.030769 1987.743590
std 605.373041 602.816015 614.245612 608.330435 617.123397 610.442792 612.281097 606.053716 564.608763 569.926448 ... 1431.441055 1433.272527 1418.456847 1441.871804 1476.487543 1518.852409 1516.524093 1559.755626 1604.763852 1616.979912
min 224.000000 224.000000 224.000000 224.000000 225.000000 225.000000 225.000000 225.000000 225.000000 225.000000 ... 334.000000 337.000000 339.000000 342.000000 345.000000 348.000000 351.000000 354.000000 357.000000 360.000000
25% 680.500000 680.500000 680.500000 680.500000 680.500000 680.500000 680.500000 680.500000 680.500000 680.500000 ... 883.000000 903.000000 932.500000 935.500000 939.000000 948.000000 948.500000 950.500000 955.000000 957.000000
50% 959.000000 959.000000 959.000000 959.000000 954.000000 954.000000 956.000000 960.000000 960.000000 960.000000 ... 1250.000000 1290.000000 1290.000000 1300.000000 1300.000000 1310.000000 1310.000000 1320.000000 1320.000000 1330.000000
75% 1240.000000 1240.000000 1240.000000 1245.000000 1250.000000 1250.000000 1255.000000 1255.000000 1255.000000 1255.000000 ... 2185.000000 2220.000000 2250.000000 2305.000000 2340.000000 2350.000000 2335.000000 2430.000000 2455.000000 2410.000000
max 4860.000000 4780.000000 5040.000000 4930.000000 5170.000000 4860.000000 4900.000000 4490.000000 3780.000000 3850.000000 ... 8650.000000 8510.000000 7940.000000 7960.000000 8320.000000 8380.000000 8610.000000 8720.000000 9250.000000 9340.000000

8 rows × 101 columns

Income per person data has all expected float/ integer fields as objects (strings). This is due to the presence of K in the numbers denoting thousands. So a value of 6.6k for 2017 in Australia means its 66000.00.
Let's convert the numbers to their true values by replacing them with their integer values and again run the info and describe methods to ensure they are in correct datatypes.

In [15]:
for column in df_income_per_person.columns[1:]:
    is_thousand = df_income_per_person[column].astype(str).str.contains('k', na = False)
    df_income_per_person[column].replace("k$","", regex = True, inplace = True)
    df_income_per_person[column] = pd.to_numeric(df_income_per_person[column])
    df_income_per_person[column] = np.where(is_thousand, df_income_per_person[column] *1000, df_income_per_person[column])

print(df_income_per_person.info())    
df_income_per_person.describe()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Columns: 252 entries, country to 2050
dtypes: float64(150), int64(101), object(1)
memory usage: 384.0+ KB
None
Out[15]:
1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 ... 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050
count 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 ... 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000
mean 1072.179487 1071.353846 1074.210256 1073.753846 1075.430769 1075.174359 1076.169231 1076.251282 1066.497436 1067.553846 ... 31265.897436 31926.974359 32628.666667 33321.487179 34038.717949 34775.692308 35524.307692 36288.205128 37069.538462 37862.666667
std 605.373041 602.816015 614.245612 608.330435 617.123397 610.442792 612.281097 606.053716 564.608763 569.926448 ... 32907.191950 33561.834556 34317.547783 35048.008199 35799.737313 36575.963727 37360.004510 38162.975396 38986.791899 39818.346586
min 224.000000 224.000000 224.000000 224.000000 225.000000 225.000000 225.000000 225.000000 225.000000 225.000000 ... 1050.000000 1070.000000 1100.000000 1120.000000 1140.000000 1170.000000 1190.000000 1220.000000 1240.000000 1270.000000
25% 680.500000 680.500000 680.500000 680.500000 680.500000 680.500000 680.500000 680.500000 680.500000 680.500000 ... 6550.000000 6690.000000 6840.000000 6985.000000 7135.000000 7290.000000 7445.000000 7610.000000 7770.000000 7935.000000
50% 959.000000 959.000000 959.000000 959.000000 954.000000 954.000000 956.000000 960.000000 960.000000 960.000000 ... 20000.000000 20400.000000 20900.000000 21300.000000 21800.000000 22200.000000 22700.000000 23200.000000 23700.000000 24200.000000
75% 1240.000000 1240.000000 1240.000000 1245.000000 1250.000000 1250.000000 1255.000000 1255.000000 1255.000000 1255.000000 ... 49150.000000 50250.000000 51300.000000 52400.000000 53550.000000 54750.000000 55900.000000 57100.000000 58350.000000 59550.000000
max 4860.000000 4780.000000 5040.000000 4930.000000 5170.000000 4860.000000 4900.000000 4490.000000 3780.000000 3850.000000 ... 173000.000000 176000.000000 180000.000000 184000.000000 188000.000000 192000.000000 196000.000000 200000.000000 205000.000000 209000.000000

8 rows × 251 columns

The struture is now similar to other dataframes with country and years as columns. This is a little different due to the presence of multiindex but it can be used to select relavant dataeasily. All year columns have income per person data as floats for all countries after the transformations. So, its good to go. Lets check for NULL values now.

In [16]:
df_income_per_person.isna().sum()
Out[16]:
country    0
1800       0
1801       0
1802       0
1803       0
          ..
2046       0
2047       0
2048       0
2049       0
2050       0
Length: 252, dtype: int64
There are no NULL values for this data. This marks the end of cleaning datasets.¶

Exploratory Data Analysis¶

Go back to top¶

Research Question 1 : What is the gender ratio of students who enroll in primary and secondary education for least 5 gross income countries?¶

Before anything, we need to make sure the countries present in gender ratio for school enrollment has same countries as the income dataset.¶
In [17]:
#get sets of both countries and find the common among both using interection of sets
countries_gend_ratio_pr_sec_enrollment = set(df_gend_ratio_pr_sec_enrollment.country)
print(f"No of countries in enrollment dataset is {len(countries_gend_ratio_pr_sec_enrollment)}.")
countries_income_per_person = set(df_income_per_person.country)
print(f"No of countries in income dataset is {len(countries_income_per_person)}.")

common_country_list_rq1 = list(countries_income_per_person.intersection(countries_gend_ratio_pr_sec_enrollment))
print(f"No of common countries among enrollment and income datasets are {len(common_country_list_rq1)}.")

#filter both dataframes to keep only those rows corresponding to common countries
df_gend_ratio_enrollment_rq1 = df_gend_ratio_pr_sec_enrollment[df_gend_ratio_pr_sec_enrollment['country'].isin(common_country_list_rq1)].reset_index(drop = True)
df_income_per_person_rq1 = df_income_per_person[df_income_per_person['country'].isin(common_country_list_rq1)].reset_index(drop = True)
No of countries in enrollment dataset is 204.
No of countries in income dataset is 195.
No of common countries among enrollment and income datasets are 190.
Now lets choose the range of years by checking both datasets, finding a common range and choosing the last 5 years to get a recent trend.¶
In [18]:
years_gend_ratio_pr_sec_enrollment = df_gend_ratio_pr_sec_enrollment.columns[1:]
print(f"Range of years in enrollment dataset is {years_gend_ratio_pr_sec_enrollment[0]}-{years_gend_ratio_pr_sec_enrollment[-1]}.")
years_income_per_person = df_income_per_person.columns[1:]
print(f"Range of years in income dataset is {years_income_per_person[0]}-{years_income_per_person[-1]}.")
Range of years in enrollment dataset is 2000-2020.
Range of years in income dataset is 1800-2050.
Based on commonality, I choose 2015 to 2020 data for my analysis as its themost recent data and common in both¶
In [19]:
year_list_rq1 = list(map(str, range(2015, 2021)))
year_list_rq1.insert(0, 'country')
df_gend_ratio_enrollment_rq1 = df_gend_ratio_enrollment_rq1[year_list_rq1].sort_values("country").reset_index(drop= True)
df_income_per_person_rq1 = df_income_per_person_rq1[year_list_rq1].sort_values("country").reset_index(drop= True)
display(df_gend_ratio_enrollment_rq1.head())
display(df_income_per_person_rq1.head())
country 2015 2016 2017 2018 2019 2020
0 Afghanistan 0.642000 0.646000 0.642000 0.636000 0.437389 0.437389
1 Albania 0.982000 0.994000 1.000000 1.020000 1.020000 1.020000
2 Algeria 0.833923 0.833923 0.833923 0.833923 0.833923 0.833923
3 Andorra 1.090000 1.090000 1.090000 1.090000 1.090000 1.090000
4 Angola 0.754889 0.754889 0.754889 0.754889 0.754889 0.754889
country 2015 2016 2017 2018 2019 2020
0 Afghanistan 2070.0 2060.0 2060.0 2030.0 2070.0 1970.0
1 Albania 11900.0 12300.0 12800.0 13300.0 13700.0 13200.0
2 Algeria 11700.0 11800.0 11700.0 11600.0 11500.0 10900.0
3 Andorra 52700.0 54500.0 56300.0 58300.0 58400.0 51600.0
4 Angola 8040.0 7570.0 7310.0 6930.0 6670.0 6120.0
Lets find the 5 least grossing countries¶

For this we need to find the mean income for each country, sort them in ascending order and choose the top 5.

In [20]:
df_income_per_person_rq1['mean_income'] = df_income_per_person_rq1.iloc[:,1:].mean(axis= 1)
In [21]:
lowest_income_5 = df_income_per_person_rq1.sort_values('mean_income').head().sort_values("country")
lowest_income_5
Out[21]:
country 2015 2016 2017 2018 2019 2020 mean_income
26 Burundi 825.0 795.0 774.0 762.0 752.0 732.0 773.333333
31 Central African Republic 853.0 884.0 913.0 933.0 945.0 937.0 910.833333
37 Congo, Dem. Rep. 1070.0 1060.0 1060.0 1090.0 1100.0 1080.0 1076.666667
121 Niger 1130.0 1150.0 1160.0 1200.0 1220.0 1220.0 1180.000000
152 Somalia 1030.0 1080.0 1080.0 1130.0 1190.0 1150.0 1110.000000
Let's plot the incomes to see the range and how theleast 5 compare to the complete set¶
In [22]:
fig = px.bar(df_income_per_person_rq1, x= "country", y = 'mean_income', 
              title='Income per person in all countries', 
              color = "country", 
             labels = {"mean_income": "Mean Income (International Dollars)", "country": "Country"},
            color_discrete_sequence = px.colors.qualitative.Pastel
            )
fig.add_trace(go.Scatter(x=lowest_income_5.country, y=lowest_income_5["mean_income"],
                    mode='markers',
                    name='Least Income', line=dict(color='firebrick')))
fig.show() 
In [23]:
lowest_income_gend_ratio_enrollment = df_gend_ratio_enrollment_rq1[df_gend_ratio_enrollment_rq1['country'].isin(lowest_income_5.country)]
lowest_income_gend_ratio_enrollment = lowest_income_gend_ratio_enrollment.set_index('country').unstack().reset_index().rename(columns = {'level_0': 'year', 0: 'enrollmentRatio'}).sort_values(["country", "year"])[['country', 'year', 'enrollmentRatio']].reset_index(drop= True)
lowest_income_gend_ratio_enrollment
Out[23]:
country year enrollmentRatio
0 Burundi 2015 1.000000
1 Burundi 2016 1.010000
2 Burundi 2017 1.020000
3 Burundi 2018 1.030000
4 Burundi 2019 1.040000
5 Burundi 2020 0.777000
6 Central African Republic 2015 0.578955
7 Central African Republic 2016 0.761000
8 Central African Republic 2017 0.578955
9 Central African Republic 2018 0.578955
10 Central African Republic 2019 0.578955
11 Central African Republic 2020 0.578955
12 Congo, Dem. Rep. 2015 0.889000
13 Congo, Dem. Rep. 2016 0.685452
14 Congo, Dem. Rep. 2017 0.685452
15 Congo, Dem. Rep. 2018 0.685452
16 Congo, Dem. Rep. 2019 0.685452
17 Congo, Dem. Rep. 2020 0.685452
18 Niger 2015 0.825000
19 Niger 2016 0.831000
20 Niger 2017 0.838000
21 Niger 2018 0.639667
22 Niger 2019 0.639667
23 Niger 2020 0.639667
24 Somalia 2015 0.483563
25 Somalia 2016 0.483563
26 Somalia 2017 0.483563
27 Somalia 2018 0.483563
28 Somalia 2019 0.483563
29 Somalia 2020 0.483563
Lets find the average gender ratio of enrollment for all 190 countries in the list to compare it with the lowest income countries.¶
In [24]:
mean_gend_ratio_enrollment_per_year = df_gend_ratio_enrollment_rq1.iloc[:,1:].mean(axis= 0).to_frame().reset_index().rename(columns = {'index': 'year', 0: 'meanEnrollmentRatio'})
mean_gend_ratio_enrollment_per_year
Out[24]:
year meanEnrollmentRatio
0 2015 0.954438
1 2016 0.954636
2 2017 0.955408
3 2018 0.950834
4 2019 0.946897
5 2020 0.941240
In [25]:
fig = px.line(lowest_income_gend_ratio_enrollment, x= "year", y = 'enrollmentRatio', 
              title='Gender ratio of students for least 5 gross income countries', 
              color = "country",
              color_discrete_sequence= [px.colors.qualitative.Pastel[1],
                                        px.colors.qualitative.Pastel[7],
                                        px.colors.qualitative.Pastel[3],
                                        px.colors.qualitative.Pastel[4],
                                        px.colors.qualitative.Pastel[5]
                                       ], 
              markers=True, symbol="country",
              labels={
                     "year": "Year",
                     "enrollmentRatio": "Girls to Boys Enrollment Ratio",
                     "country": "Country"
              })
fig.add_trace(go.Scatter(x=mean_gend_ratio_enrollment_per_year.year, y=mean_gend_ratio_enrollment_per_year.meanEnrollmentRatio,
                    mode='lines+markers',
                    name='Mean Enrollment Gender Ratio', line=dict(color='firebrick')))
fig.show() 

Research Question 2 : Does employement ratio get affected based on more no of years spent in school?¶

Go back to top¶
Before anything, we need to make sure the countries present in gender ratio for school years has same countries as the employement rate dataset.¶
In [26]:
#get sets of both countries and find the common among both using interection of sets
countries_emp = set(df_emp.country)
print(f"No of countries in employement rate dataset is {len(countries_emp)}.")
countries_gend_ratio_pr_sec_ter_yrs = set(df_gend_ratio_pr_sec_ter_yrs.country)
print(f"No of countries in no of school years dataset is {len(countries_gend_ratio_pr_sec_ter_yrs)}.")

common_country_list_rq2 = list(countries_gend_ratio_pr_sec_ter_yrs.intersection(countries_emp))
print(f"No of common countries among enrollment and income datasets are {len(common_country_list_rq2)}.")

#filter both dataframes to keep only those rows corresponding to common countries
df_emp_rq2 = df_emp[df_emp['country'].isin(common_country_list_rq2)].reset_index(drop = True)
df_gend_ratio_pr_sec_ter_yrs_rq2 = df_gend_ratio_pr_sec_ter_yrs[df_gend_ratio_pr_sec_ter_yrs['country'].isin(common_country_list_rq2)].reset_index(drop = True)
No of countries in employement rate dataset is 189.
No of countries in no of school years dataset is 188.
No of common countries among enrollment and income datasets are 180.
Now lets choose the range of years by checking both datasets, finding a common range and choosing the last 5 years to get a recent trend.¶
In [27]:
years_emp = df_emp.columns[1:]
print(f"Range of years in employment rates dataset is {years_emp[0]}-{years_emp[-1]}.")
years_gend_ratio_pr_sec_ter_yrs = df_gend_ratio_pr_sec_ter_yrs.columns[1:]
print(f"Range of years in school years dataset is {years_gend_ratio_pr_sec_ter_yrs[0]}-{years_gend_ratio_pr_sec_ter_yrs[-1]}.")
Range of years in employment rates dataset is 1991-2020.
Range of years in school years dataset is 1970-2015.
Based on commonality, I choose 1991 to 2015 data for my analysis as its themost recent data and common in both¶
In [28]:
year_list_rq2 = list(map(str, range(1991, 2016)))
year_list_rq2.insert(0, 'country')
df_emp_rq2 = df_emp_rq2[year_list_rq2].sort_values("country").reset_index(drop= True)
df_gend_ratio_pr_sec_ter_yrs_rq2 = df_gend_ratio_pr_sec_ter_yrs_rq2[year_list_rq2].sort_values("country").reset_index(drop= True)
display(df_emp_rq2.head())
display(df_gend_ratio_pr_sec_ter_yrs_rq2.head())
country 1991 1992 1993 1994 1995 1996 1997 1998 1999 ... 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
0 Afghanistan 42.5 42.5 42.5 42.5 42.4 42.4 42.3 42.2 42.2 ... 42.9 42.8 42.7 42.4 42.2 42.3 42.4 42.5 42.7 42.9
1 Albania 57.8 58.2 56.8 55.7 54.1 53.3 54.5 53.8 52.7 ... 47.4 46.9 47.9 47.1 46.6 52.0 49.4 44.7 43.7 46.0
2 Algeria 35.4 33.7 33.1 32.5 30.7 32.1 33.2 32.4 31.4 ... 37.2 36.5 37.4 37.8 38.1 38.3 38.0 39.4 37.3 37.2
3 Angola 75.0 75.0 75.2 75.1 74.9 74.9 74.8 74.7 74.6 ... 74.2 74.3 74.3 74.3 70.0 71.7 71.8 71.8 71.9 71.9
4 Argentina 57.3 56.9 54.9 54.0 49.5 50.7 52.5 54.1 53.1 ... 56.0 56.3 56.0 55.8 55.7 56.3 56.1 56.0 55.4 55.5

5 rows × 26 columns

country 1991 1992 1993 1994 1995 1996 1997 1998 1999 ... 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
0 Afghanistan 18.5 18.9 18.9 19.2 19.1 19.4 19.7 19.9 20.1 ... 21.5 21.9 22.2 22.3 22.6 22.9 23.1 23.4 23.5 23.7
1 Albania 96.1 96.5 96.9 97.1 97.4 97.8 98.1 98.5 98.7 ... 100.0 101.0 101.0 101.0 101.0 102.0 102.0 102.0 102.0 103.0
2 Algeria 87.7 87.3 86.6 85.9 85.4 85.4 85.6 85.6 85.8 ... 87.6 88.0 88.3 88.7 89.0 89.4 89.7 90.0 90.5 90.8
3 Angola 60.4 60.8 61.2 61.8 62.4 62.8 63.5 63.9 64.5 ... 68.5 68.9 69.5 70.1 70.5 71.2 71.7 72.2 72.9 73.3
4 Argentina 104.0 104.0 105.0 105.0 105.0 105.0 106.0 106.0 106.0 ... 107.0 107.0 107.0 108.0 108.0 108.0 108.0 108.0 108.0 108.0

5 rows × 26 columns

To understand the relationship (positive correlation or negative) among gender ratio of school years and employement rate, and how it changed over these years, we need to find the correlation and next will be comparing it with the mean of gender ratio of number of school years. We will analyze if they both are in same correlation or different.

In [29]:
emp_gend_ratio_yrs_corr = df_emp_rq2.corrwith(df_gend_ratio_pr_sec_ter_yrs_rq2, axis = 0).to_frame().reset_index().rename(columns = {'index': 'year', 0: 'correlationValue'})
emp_gend_ratio_yrs_corr
Out[29]:
year correlationValue
0 1991 -0.211374
1 1992 -0.215694
2 1993 -0.226082
3 1994 -0.229419
4 1995 -0.226484
5 1996 -0.236006
6 1997 -0.230684
7 1998 -0.225490
8 1999 -0.223718
9 2000 -0.216136
10 2001 -0.209794
11 2002 -0.209221
12 2003 -0.205498
13 2004 -0.194099
14 2005 -0.175256
15 2006 -0.154981
16 2007 -0.133490
17 2008 -0.120964
18 2009 -0.140309
19 2010 -0.140064
20 2011 -0.128902
21 2012 -0.112533
22 2013 -0.106454
23 2014 -0.089712
24 2015 -0.081037
In [30]:
#finding mean of correlation for all years
emp_gend_ratio_yrs_corr['meanRatio'] = df_gend_ratio_pr_sec_ter_yrs_rq2.iloc[:,1:].mean(axis= 0).to_frame().reset_index(drop=True).rename(columns = { 0: 'meanSchoolYearsRatio'})

#standardizing the mean as per the scale of correlationValue for better visualization
emp_gend_ratio_yrs_corr['standardizedMeanRatio'] = ((emp_gend_ratio_yrs_corr['meanRatio'] - emp_gend_ratio_yrs_corr['meanRatio'].min()) * 0.15 )/ (emp_gend_ratio_yrs_corr['meanRatio'].max() - emp_gend_ratio_yrs_corr['meanRatio'].min()) - 0.24
emp_gend_ratio_yrs_corr
Out[30]:
year correlationValue meanRatio standardizedMeanRatio
0 1991 -0.211374 83.071111 -0.240000
1 1992 -0.215694 83.503889 -0.232181
2 1993 -0.226082 83.945000 -0.224211
3 1994 -0.229419 84.351111 -0.216874
4 1995 -0.226484 84.733889 -0.209958
5 1996 -0.236006 85.145000 -0.202530
6 1997 -0.230684 85.548333 -0.195243
7 1998 -0.225490 85.941111 -0.188146
8 1999 -0.223718 86.291667 -0.181813
9 2000 -0.216136 86.682222 -0.174756
10 2001 -0.209794 87.045000 -0.168202
11 2002 -0.209221 87.411667 -0.161577
12 2003 -0.205498 87.713889 -0.156117
13 2004 -0.194099 88.073889 -0.149613
14 2005 -0.175256 88.424444 -0.143279
15 2006 -0.154981 88.756111 -0.137287
16 2007 -0.133490 89.038889 -0.132177
17 2008 -0.120964 89.352222 -0.126516
18 2009 -0.140309 89.685000 -0.120504
19 2010 -0.140064 90.006111 -0.114702
20 2011 -0.128902 90.285000 -0.109663
21 2012 -0.112533 90.588333 -0.104183
22 2013 -0.106454 90.851111 -0.099435
23 2014 -0.089712 91.121111 -0.094557
24 2015 -0.081037 91.373333 -0.090000

Let's plot the yearly trend to see how correlation changed accross 1991 to 2015.

In [31]:
fig = px.line(emp_gend_ratio_yrs_corr, x= "year", y = 'correlationValue', 
              title='Yearly trend of relationship between employement rate and school enrollment ratio',  
             labels = {"correlationValue": "Correlation Value", "country": "Country", "year": "Year"},
            color_discrete_sequence = px.colors.qualitative.Pastel, 
              markers = True
            )
fig.add_trace(go.Scatter(x=emp_gend_ratio_yrs_corr.year, y=emp_gend_ratio_yrs_corr.standardizedMeanRatio,
                    mode='lines+markers',
#                     name='Mean School years Ratio', 
                         line=dict(color='firebrick')))
fig.layout.update(showlegend=False)
fig.show() 

Next, lets observe the countrywise correlation between school years and employement rate and see where the impact has most positive correlation and where is it most negative. This helps us visualize where in our world improvement in ratio has a positive/ negative impact on employement.

In [32]:
common_country_list_rq2.sort()
common_country_dict_rq2 = {k: v for k, v in enumerate(common_country_list_rq2)}
common_country_dict_rq2
Out[32]:
{0: 'Afghanistan',
 1: 'Albania',
 2: 'Algeria',
 3: 'Angola',
 4: 'Argentina',
 5: 'Armenia',
 6: 'Australia',
 7: 'Austria',
 8: 'Azerbaijan',
 9: 'Bahamas',
 10: 'Bahrain',
 11: 'Bangladesh',
 12: 'Barbados',
 13: 'Belarus',
 14: 'Belgium',
 15: 'Belize',
 16: 'Benin',
 17: 'Bhutan',
 18: 'Bolivia',
 19: 'Bosnia and Herzegovina',
 20: 'Botswana',
 21: 'Brazil',
 22: 'Brunei',
 23: 'Bulgaria',
 24: 'Burkina Faso',
 25: 'Burundi',
 26: 'Cambodia',
 27: 'Cameroon',
 28: 'Canada',
 29: 'Cape Verde',
 30: 'Central African Republic',
 31: 'Chad',
 32: 'Chile',
 33: 'China',
 34: 'Colombia',
 35: 'Comoros',
 36: 'Congo, Dem. Rep.',
 37: 'Congo, Rep.',
 38: 'Costa Rica',
 39: "Cote d'Ivoire",
 40: 'Croatia',
 41: 'Cuba',
 42: 'Cyprus',
 43: 'Czech Republic',
 44: 'Denmark',
 45: 'Djibouti',
 46: 'Dominican Republic',
 47: 'Ecuador',
 48: 'Egypt',
 49: 'El Salvador',
 50: 'Equatorial Guinea',
 51: 'Eritrea',
 52: 'Estonia',
 53: 'Eswatini',
 54: 'Ethiopia',
 55: 'Fiji',
 56: 'Finland',
 57: 'France',
 58: 'Gabon',
 59: 'Gambia',
 60: 'Georgia',
 61: 'Germany',
 62: 'Ghana',
 63: 'Greece',
 64: 'Guatemala',
 65: 'Guinea',
 66: 'Guinea-Bissau',
 67: 'Guyana',
 68: 'Haiti',
 69: 'Honduras',
 70: 'Hungary',
 71: 'Iceland',
 72: 'India',
 73: 'Indonesia',
 74: 'Iran',
 75: 'Iraq',
 76: 'Ireland',
 77: 'Israel',
 78: 'Italy',
 79: 'Jamaica',
 80: 'Japan',
 81: 'Jordan',
 82: 'Kazakhstan',
 83: 'Kenya',
 84: 'Kuwait',
 85: 'Kyrgyz Republic',
 86: 'Lao',
 87: 'Latvia',
 88: 'Lebanon',
 89: 'Lesotho',
 90: 'Liberia',
 91: 'Libya',
 92: 'Lithuania',
 93: 'Luxembourg',
 94: 'Madagascar',
 95: 'Malawi',
 96: 'Malaysia',
 97: 'Maldives',
 98: 'Mali',
 99: 'Malta',
 100: 'Mauritania',
 101: 'Mauritius',
 102: 'Mexico',
 103: 'Moldova',
 104: 'Mongolia',
 105: 'Montenegro',
 106: 'Morocco',
 107: 'Mozambique',
 108: 'Myanmar',
 109: 'Namibia',
 110: 'Nepal',
 111: 'Netherlands',
 112: 'New Zealand',
 113: 'Nicaragua',
 114: 'Niger',
 115: 'Nigeria',
 116: 'North Korea',
 117: 'North Macedonia',
 118: 'Norway',
 119: 'Oman',
 120: 'Pakistan',
 121: 'Palestine',
 122: 'Panama',
 123: 'Papua New Guinea',
 124: 'Paraguay',
 125: 'Peru',
 126: 'Philippines',
 127: 'Poland',
 128: 'Portugal',
 129: 'Qatar',
 130: 'Romania',
 131: 'Russia',
 132: 'Rwanda',
 133: 'Samoa',
 134: 'Sao Tome and Principe',
 135: 'Saudi Arabia',
 136: 'Senegal',
 137: 'Serbia',
 138: 'Sierra Leone',
 139: 'Singapore',
 140: 'Slovak Republic',
 141: 'Slovenia',
 142: 'Solomon Islands',
 143: 'Somalia',
 144: 'South Africa',
 145: 'South Korea',
 146: 'South Sudan',
 147: 'Spain',
 148: 'Sri Lanka',
 149: 'St. Lucia',
 150: 'St. Vincent and the Grenadines',
 151: 'Sudan',
 152: 'Suriname',
 153: 'Sweden',
 154: 'Switzerland',
 155: 'Syria',
 156: 'Taiwan',
 157: 'Tajikistan',
 158: 'Tanzania',
 159: 'Thailand',
 160: 'Timor-Leste',
 161: 'Togo',
 162: 'Tonga',
 163: 'Trinidad and Tobago',
 164: 'Tunisia',
 165: 'Turkey',
 166: 'Turkmenistan',
 167: 'Uganda',
 168: 'Ukraine',
 169: 'United Arab Emirates',
 170: 'United Kingdom',
 171: 'United States',
 172: 'Uruguay',
 173: 'Uzbekistan',
 174: 'Vanuatu',
 175: 'Venezuela',
 176: 'Vietnam',
 177: 'Yemen',
 178: 'Zambia',
 179: 'Zimbabwe'}
In [33]:
emp_gend_ratio_yrs_country_corr = df_emp_rq2.corrwith(df_gend_ratio_pr_sec_ter_yrs_rq2, axis = 1).to_frame().reset_index().rename(columns = {'index': 'country', 0: 'correlationValue'})
emp_gend_ratio_yrs_country_corr.replace(common_country_dict_rq2, inplace = True)
emp_gend_ratio_yrs_country_corr
Out[33]:
country correlationValue
0 Afghanistan 0.323407
1 Albania -0.895268
2 Algeria 0.895060
3 Angola -0.809776
4 Argentina 0.243432
... ... ...
175 Venezuela 0.345778
176 Vietnam 0.152598
177 Yemen -0.934838
178 Zambia 0.602975
179 Zimbabwe 0.385582

180 rows × 2 columns

Let's see the most positively correlated countries to most negatively correlated countries. We can further analyse what happened in those countries to understand the trend better

In [34]:
high_neg_corr = emp_gend_ratio_yrs_country_corr.sort_values("correlationValue").head()
high_pos_corr = emp_gend_ratio_yrs_country_corr.sort_values("correlationValue").tail()
display(high_neg_corr)
display(high_pos_corr)
country correlationValue
143 Somalia -0.993291
24 Burkina Faso -0.985139
33 China -0.984672
108 Myanmar -0.981745
100 Mauritania -0.979770
country correlationValue
150 St. Vincent and the Grenadines 0.954454
88 Lebanon 0.955298
41 Cuba 0.959584
125 Peru 0.972933
91 Libya 0.987849
In [36]:
clrred = 'rgb(222, 0, 0)'
clrgrn = 'rgb(0, 222, 0)'
clrs  = [clrred if c in list(high_neg_corr.country) else clrgrn if c in list(high_pos_corr.country) else px.colors.qualitative.Pastel[1] for c in emp_gend_ratio_yrs_country_corr.country]

fig = go.Figure(data=[go.Bar( x= emp_gend_ratio_yrs_country_corr["country"], y = emp_gend_ratio_yrs_country_corr['correlationValue'], 
               
             
            marker_color = clrs
            )])
fig.update_layout(xaxis_tickangle=-45, title='Correlation of Employement Rate and Gender Ratio of School Years', xaxis_title="Country",
    yaxis_title="Correlation")
fig.show() 

The green and red bars denote the highest positive correlation and lowest negative correlations among the 180 countries.

Conclusions¶

Go back to top¶

Based on our plots, the following points can be concluded-¶

  1. Burundi, Central African Republic, Congo, Democatic Republic, Niger and Somalia have the least income and whats more is all of them lie in the African continent which shows that its one of the poorest continents of the world.
  2. When looking at these 5 low income countries, all except Burundi have the girls to boys enrollment ratio lesser than the world average. This is very interesting to observe as Burundi has the lowest income per person in the world. This shows that the women empowerment can be seen in it, since even though they earn less they have far more better ways of moving forward and that should be learnt by all. Only for the year 2020, Burundi went down the average majorly due to the effect of Covid.
  3. Somalia has the least enrollment ratio in the whole world. And the ratio has been consistent among the last 5 years.
  4. A common trend of decreased enrollment in schools can be spotted worldwide (by looking the world aveage) in 2020. The effect of Covid on education can be observed through this. The highest decline happened in Burundi with a 25% decrease in the enrollment ratio.
  5. The second highest decline in enrollment ratio happened in Congo and it didn't get any better till now.
  6. The correlation between employement rate and gender ratio of school years is mostly negative. But what is worth noticing is that its becoming less negative, means the inclusion of more no of years spent by women in the school has a positive relationship with employement rates. Hence, as more women spend more years in school, they are becoming better workforce and improving the employement workforce.
  7. The trend of avergae employement rates increase year by year. And the correlation also follows the same trend showing that both follow a similar trend.
  8. A worth-noticing point is a drop in the correlation during the year of 2009 and 2010. On research, I found that this is due to global recession that happened in 2008. The employement rates increased yet people wouldn't have the finances to onboard students in schools hence, reducing the ratio and the correlation.
  9. Libya, Cuba, Lebenon, Peru and St. Vincent and the Grenadines are the ones with most positive relationship with employability rates and gender ratio of school years. The opposite was seen in Burkina Faso, China, Mauritiana, Myanmar and Somalia. Somalia is one of the lowest income countries as well. China is the most surprising entry here.

On whole, the education state of the world is improving and the related employability situation as well.

That's all!